In [156]:
%matplotlib inline
import matplotlib.pyplot as plt
import sqlite3
import pandas as pd
import seaborn as sns
sns.set_style("white")
In [79]:
ls -l ../data/output | grep -v "M*.csv"
In [57]:
conn = sqlite3.connect('../data/output/database.sqlite')
c = conn.cursor()
def execute(sql):
'''
Executes a SQL command on the 'c' cursor and returns the results
'''
c.execute(sql)
return c.fetchall()
Apparently, Sqlite is a serverless database (pretty cool, didn't know) so all I need to do is give it a file path, then I can start making queries. I created an execute method to save me couple of keystrokes and make it easier if I change the cursor.
In [157]:
def printByYear(data):
'''
Given a list of tuples with (year, data), prints the data next to corresponding year
'''
for datum in data:
print "{0}: {1}".format(datum[0], datum[1])
In [58]:
# Print all tables in the database
tables = execute("SELECT Name FROM sqlite_master WHERE type='table'")
for table in tables:
print table
There is one table and it is called Scorecard.
In [82]:
# Print the number of rows in the database
rowCount = execute("SELECT Count(id) RowCount from Scorecard")
rowCount = rowCount[0][0]
print "Row count:", rowCount
In [158]:
# Number of intstiutions in dataset per year
rowCountByYear = execute("""SELECT Year, Count(id)
FROM Scorecard
GROUP BY Year""")
printByYear(rowCountByYear)
In [60]:
# Print the number of columns
fields = execute("PRAGMA table_info(Scorecard)")
print len(fields)
There are too many columns to do an analysis of the data compostion from the above query. In order to understand what is included in the dataset I am looking through the FullDataDocumentation.pdf which covers the different categories of data and summarizes what is include in each category. Below is a table of things I found of interest.
All the fields in the database are in CollegeScorecardDataDictionary-09-12-2015.pdf along with the field name.
Category | Field Type | Details |
---|---|---|
School | ||
Name | Name of the school | |
Location | Longitude, latitude, city, etc. | |
Main campus or branch | 1 for main, 0 for branch | |
Type | Public, private for-profit, private non-profit | |
Revenue | Net tutition, instructional expenses, average faculty salary | |
Currently operating | 1 for YES, 0 for NO | |
Academics | ||
Programs offered | Field of study ids | |
Admission | ||
Admission rate for Undergraduates | Rates for each branch or all branches | |
SAT and ACT scores | ||
Costs | ||
LOTS | Look at page 6-7 of data documentation | |
Student Body | ||
Number of degree seekers | ||
Race statistics | Self-reported on FAFSA | |
Undegrads by family income | Percentages broken into categories based on expected family contribution on FAFSA | |
Retention rate | ||
Student body age | Percentage of students who are age 25-64 | |
Parent education | Percentages of students with parents that have some level of education. Collected from FAFSA. | |
Financial Aid | ||
Cumulative median debt | ||
Percent receiving Pell Grant | Grant from government that does not need to be repaid | |
Percent receiving federal loans | ||
Completion | ||
Completion rate | Percent of students graduating within 150% and 200% of the expected time to graduate | |
Title IV Students | ||
Earnings | ||
Mean and median income | Available 6 and 10 years after graduation. Data comes from W2 tax forms | |
Threshold earnings | Percent of students who are earning more than people age 25-34 with only a college degree. Measure of wether or not the degree was able to financially improve the students outcome had they not gone to college. | |
Repayment |
In [167]:
# Get locations of the universities
coordinates = execute("""SELECT Latitude, Longitude
FROM Scorecard
WHERE Latitude IS NOT NULL
AND Year=2013
AND main='Main campus'""")
In [168]:
print "Percent with location data (2013): {0: .2f}%".format(((len(coordinates)*1.0) / rowCountByYear[14][1]) * 100)
# Plot locations of institutions
df = pd.DataFrame()
def checkCordinates(x, y):
if x >= -128.19 and x <= -65 and y >= 24.19 and y <= 49.62:
return True
return False
df['x'] = [row[1] for row in coordinates if checkCordinates(row[1], row[0])]
df['y'] = [row[0] for row in coordinates if checkCordinates(row[1], row[0])]
locations = sns.regplot('x', 'y',
data=df,
fit_reg=False)
locations.set(title="Locations of Institutions", xticks=[], yticks=[], xlabel="", ylabel="")
sns.despine(left=True, bottom=True)
In [161]:
# Number of institutions that are main campuses by year
mainCampuses = execute("""SELECT Year, Count(id) mainCount
FROM Scorecard
WHERE main='Main campus'
GROUP BY Year""")
In [166]:
print "Number of total main campuses: {0}".format(sum([count[1] for count in mainCampuses]))
printByYear(mainCampuses)
print "\nMain campus percentages"
for i, count in enumerate(mainCampuses):
print "{0}: {1: .2f}%".format(count[0], ((count[1]*1.0)/rowCountByYear[i][1])*100)
I think a visualization with a picutre of all the universities in the US would be cool. Maybe as they add certain constaints the user oculd see the dots become highlighted and unlighted. Along with that there would be bar on the right with the most prominent universities in the categoires. Could also allow them to pin certain universities or institutions they were interested in.
As for direction to explore, I kind of what to explore everything, or at least let the user explore everything.
In [ ]: